Interestingness is Not a Dichotomy: Introducing Softness in Constrained Pattern Mining

نویسندگان

  • Stefano Bistarelli
  • Francesco Bonchi
چکیده

The paradigm of pattern discovery based on constraints was introduced with the aim of providing to the user a tool to drive the discovery process towards potentially interesting patterns, with the positive side effect of achieving a more efficient computation. So far the research on this paradigm has mainly focussed on the latter aspect: the development of efficient algorithms for the evaluation of constraint-based mining queries. Due to the lack of research on methodological issues, the constraint-based pattern mining framework still suffers from many problems which limit its practical relevance. As a solution, in this paper we introduce the new paradigm of pattern discovery based on Soft Constraints. Albeit simple, the proposed paradigm overcomes all the major methodological drawbacks of the classical constraint-based paradigm, representing an important step further towards practical pattern discovery. 1 Background and Motivations During the last decade a lot of researchers have focussed their (mainly algorithmic) investigations on the computational problem of Frequent Pattern Discovery, i.e. mining patterns which satisfy a user-defined constraint of minimum frequency [1]. The simplest form of a frequent pattern is the frequent itemset. Definition 1 (Frequent Itemset Mining). Let I = {x1, ..., xn} be a set of distinct items, where an item is an object with some predefined attributes (e.g., price, type, etc.). An itemset X is a non-empty subset of I. A transaction database D is a bag of itemsets t ∈ 2 , usually called transactions. The support of an itemset X in database D, denoted supp D (X), is the number of transactions which are superset of X . Given a user-defined minimum support σ, an itemset X is called frequent in D if supp D (X) ≥ σ. This defines the minimum frequency constraint: Cfreq[D,σ](X) ⇔ suppD(X) ≥ σ. Recently the research community has turned its attention to more complex kinds of frequent patterns extracted from more structured data: sequences, trees, and graphs. All these different kinds of pattern have different peculiarities and application fields, but they all share the same computational aspects: a usually very large input, an exponential search space, and a too large solution set. This situation – too many data yielding too many patterns – is harmful for two reasons. First, performance degrades: mining generally becomes inefficient or, often, simply unfeasible. Second, the identification of the fragments of interesting knowledge, blurred within a huge quantity of mostly useless patterns, is difficult. The paradigm of constraint-based pattern mining was introduced as a solution to both these problems. In such paradigm, it is the user which specifies to the system what is interesting for the current application: constraints are a tool to drive the mining process towards potentially interesting patterns, moreover they can be pushed deep inside the mining algorithm in order to fight the exponential search space curse, and to achieve better performance [15, 20, 25]. When instantiated to the pattern class of itemsets, the constraint-based pattern mining problem is defined as follows. Definition 2 (Constrained Frequent Itemset Mining). A constraint on itemsets is a function C : 2 → {true, false}. We say that an itemset I satisfies a constraint if and only if C(I) = true. We define the theory of a constraint as the set of itemsets which satisfy the constraint: Th(C) = {X ∈ 2 | C(X)}. Thus with this notation, the frequent itemsets mining problem requires to compute the set of all frequent itemsets Th(Cfreq[D,σ]). In general, given a conjunction of constraints C the constrained frequent itemsets mining problem requires to compute Th(Cfreq) ∩ Th(C). Example 1. The following is an example mining query: Q : supp D (X) ≥ 1500 ∧ avg(X.weight) ≤ 5 ∧ sum(X.price) ≥ 20 It requires to mine, from database D, all patterns which are frequent (have a support larger than 1500), have average weight less than 5 and a sum of prices greater than 20. So far constraint-based frequent pattern mining has been seen as a query optimization problem, i.e., developing efficient, sound and complete evaluation strategies for constraint-based mining queries. Or in other terms, designing efficient algorithms to mine all and only the patterns in Th(Cfreq) ∩ Th(C). To this aim, properties of constraints have been studied comprehensively, and on the basis of such properties (e.g., anti-monotonicity, succinctness [20, 18], monotonicity [11, 17, 6], convertibility [22], loose anti-monotonicity [9]), efficient computational strategies have been defined. Despite such effort, the constraint-based pattern mining framework still suffers from many problems which limit its practical relevance. First of all, consider the example mining query Q given above: where do the three thresholds (i.e., 1500, 5 and 20) come from? In some cases they can be precisely imposed by the application, but this is rarely the case. In most of the cases, they come from an exploratory mining process, where they are iteratively adjusted until a solution set of reasonable size is produced. This practical way of proceeding is in contrast with the basic philosophy of the constraint-based paradigm: constraints should represent what is a priori interesting, given the application background knowledge, rather than be adjusted accordingly to a preconceived output size. Another major drawback of the constraintbased pattern mining paradigm is its rigidity. Consider, for instance, the following three patterns (we use the notation 〈v1, v2, v3〉 to denote the three values corresponding to the three constraints in the conjunction in the example query Q): p1 : 〈1700, 0.8, 19〉, p2 : 〈1550, 4.8, 54〉, and p3 : 〈1550, 2.2, 26〉. The first pattern, p1, largely satisfies two out of the three given constraints, while slightly violates the third one. According to the classical constraint-based pattern mining paradigm p1 would be discarded as non interesting. Is such a pattern really less interesting than p2 and p3 which satisfy all the three constraints, but which are much less frequent than p1? Moreover, is it reasonable, in real-world applications, that all constraints are equally important? All these problems flow out from the same source: the fact that in the classical constraint-based mining framework, a constraint is a function which returns a boolean value C : 2 → {true, false}. Indeed, interestingness is not a dichotomy. This consideration suggests us a simple solution to overcome all the main drawbacks of constraint-based paradigm. Paper Contributions and Organization In this paper, as a mean to handle interestingness [26, 16, 24], we introduce the soft constraint based pattern mining paradigm, where constraints are no longer rigid boolean functions, but are “soft” functions, i.e., functions with value in a set A, which represents the set of interest levels or costs assigned to each pattern. – The proposed paradigm is not rigid: a potentially interesting pattern is not discarded for just a slight violation of a constraint. – Our paradigm creates an order of patterns w.r.t. interestingness (level of constraints satisfaction): this allows to say that a pattern is more interesting than another, instead of strictly dividing patterns in interesting and not interesting. – From the previous point it follows that our paradigm allows to express top-k queries based on constraints: the data analyst can ask for the top-10 patterns w.r.t. a given description (a conjunction of soft constraints). – Alternatively, we can ask to the system to return all and only the patterns which exhibit an interest level larger than a given threshold λ. – The proposed paradigm allows to assign different weights to different constraints, while in the classical constraint-based pattern discovery paradigm all constraints were equally important. – Last but not least, our idea is very simple and thus very general: it can be instantiated to different classes of patterns such as itemsets, sequences, trees or graphs. For the reasons listed above, we believe that the proposed paradigm represents an important step further towards practical pattern discovery. A nice feature of our proposal is that, by adopting the soft constraint based paradigm, we do not reject all research results obtained in the classical constraint-based paradigm; on the contrary, we fully exploit such algorithmic results. In other terms, our proposal is merely methodological, and it exploits previous research results that were mainly computational. The paper is organized as follows. In the next Section we briefly review the theory of soft constraints and we define the soft constraint based pattern mining paradigm. In Section 3 we discuss possible alternative instances of the paradigm. In Section 4 we formally define the Soft Constraint Based Pattern Discovery paradigm. We then focus on one of the many possible instances of the proposed paradigm, and we implement it in a concrete Pattern Discovery System. Such a system is built as a wrapper around a classical constraint pattern mining system. 2 Introducing Soft Constraints Constraint Solving is an emerging software technology for declarative description and effective solving of large problems. Many real life systems, ranging from network management [14] to complex scheduling [2], are analyzed and solved using constraint related technologies. The constraint programming process consists of the generation of requirements (constraints) and solution of these requirements, by specialized constraint solvers. When the requirements of a problem are expressed as a collection of boolean predicates over variables, we obtain what is called the crisp (or classical) Constraint Satisfaction Problem (CSP). In this case the problem is solved by finding any assignment of the variables that satisfies all the constraints. Sometimes, when a deeper analysis of a problem is required, soft constraints are used instead. Several formalizations of the concept of soft constraints are currently available. In the following, we refer to the formalization based on c-semirings [5]: a semiring-based constraint assigns to each instantiation of its variables an associated value from a partially ordered set. When dealing with crisp constraints, the values are the boolean true and false representing the admissible and/or non-admissible values; when dealing with soft constraints the values are interpreted as preferences/costs. The framework must also handle the combination of constraints. To do this one must take into account such additional values, and thus the formalism must provide suitable operations for combination (×) and comparison (+) of tuples of values and constraints. This is why this formalization is based on the mathematical concept of semiring. Definition 3 (c-semirings [5, 3]). A semiring is a tuple 〈A,+,×,0,1〉 such that: A is a set and 0,1 ∈ A; + is commutative, associative and 0 is its unit element; × is associative, distributes over +, 1 is its unit element and 0 is its absorbing element. A c-semiring (“c” stands for “constraint-based”) is a semiring 〈A,+,×,0,1〉 such that + is idempotent with 1 as its absorbing element and × is commutative. Definition 4 (soft constraints [5, 3]). Given a c-semiring S = 〈A,+,×,0,1〉 and an ordered set of variables V over a finite domain D, a constraint is a function which, given an assignment η : V → D of the variables, returns a value of the c-semiring. By using this notation we define C = η → A as the set of all possible constraints that can be built starting from S, D and V . In the following we will always use the word semiring as standing for c-semiring, and we will explain this very general concept by the point of view of pattern discovery. Example 2. Consider again the mining query Q. In this context we have that the ordered set of variables V is 〈supp D (X), avg(X.weight), sum(X.price)〉, while the domain D is: D(supp D (X)) = N, D(avg(X.weight)) = R, and D(sum(X.price)) = N. If we consider the classical crisp framework (i.e., hard constraints) we have the semiring SBool = 〈{true, false},∨,∧, false, true〉. A constraint C is a function V → D → A; for instance, supp D (X) → 1700 → true. The + operator is what we use to compare tuples of values (or patterns, in our context). Let us consider the relation ≤S (where S stands for the specified semiring) over A such that a ≤S b iff a + b = b. It is possible to prove that: ≤S is a partial order; + and × are monotone on ≤S ; 0 is its minimum and 1 its maximum, and 〈A,≤S〉 is a complete lattice with least upper bound operator +. In the context of pattern discovery a ≤S b means that the pattern b is more interesting than a, where interestingness is defined by a combination of soft constraints. When using (soft) constraints it is necessary to specify, via suitable combination operators, how the level of interest of a combination of constraints is obtained from the interest level of each constraint. The combined weight (or interest) of a combination of constraints is computed by using the operator ⊗ : C × C → C defined as (C1 ⊗ C2)η = C1η ×S C2η. Example 3. If we adopt the classical crisp framework, in the mining query Q of Example 1 we have to combine the three constraints using the ∧ operator (which is the × in the boolean semiring SBool). Consider for instance the pattern p1 : 〈1700, 0.8, 19〉 for the ordered set of variables V = 〈supp D (X), avg(X.weight), sum(X.price)〉. The first and the second constraint are satisfied leading to the semiring level true, while the third one is not satisfied and has associated level false . Combining the three values with ∧ we obtain true ∧ true ∧ false = false and we can conclude that the pattern 〈1700, 0.8, 19〉 is not interesting w.r.t. our purposes. Similarly, we can instead compute level true for pattern p3 : 〈1550, 2.2, 26〉 corresponding to an interest w.r.t. our goals. Notice that using crisp constraints, the order between values only says that we are interested to patterns with semiring level true and not interested to patterns with semiring level false (that is semiring level false ≤SBool true). 3 Instances of the Semiring Dividing patterns in interesting and non-interesting is sometimes not meaningful nor useful. Most of the times we can say that each pattern is interesting with a specific level of preference. Soft constraints can deal with preferences by moving from the two values semiring SBool to other semirings able to give a finer distinction among patters (see [3] for a comprehensive guide to the semiring framework). For our scope the fuzzy and the weighted semiring are the most suitable. Example 4 (fuzzy semiring). When using fuzzy semiring [12, 23], to each pair constraintpattern is assigned an interest level between 0 and 1, where 1 represents the best value (maximum interest) and 0 the worst one (minimum interest). Therefore the + in this semiring is given by the max operator, and the order ≤S is given by the usual ≤ on real numbers. The value associated to a pattern is obtained by combining the constraints using the minimum operator among the semiring values. Therefore the × in this semiring is given by the min operator. Recapitulating, the fuzzy semiring is given by SF = 〈[0, 1],max,min, 0, 1〉. The reason for such a max-min framework relies on the attempt to maximize the value of the least preferred tuple. Fuzzy soft constraints are able to model partial constraint satisfaction [13], so to get a solution even when the problem is overconstrained, and also prioritized constraints, that is, constraints with different levels of importance [10]. Figure 1 reports graphical representations of possible fuzzy instances of the constraints in Q. Consider, for instance, the graphical representation of the frequency constraint in Figure 1(C1). The dotted line describes the behavior

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Soft constraint based pattern mining

The paradigm of pattern discovery based on constraints was introduced with the aim of providing to the user a tool to drive the discovery process towards potentially interesting patterns, with the positive side effect of achieving a more efficient computation. So far the research on this paradigm has mainly focused on the latter aspect: the development of efficient algorithms for the evaluation...

متن کامل

Pii: S0950-7051(99)00019-2

This paper discusses several factors influencing the evaluation of the degree of interestingness of rules discovered by a data mining algorithm. This article aims at: (1) drawing attention to several factors related to rule interestingness that have been somewhat neglected in the literature; (2) showing some ways of modifying rule interestingness measures to take these factors into account; (3)...

متن کامل

Knowledge actionability: satisfying technical and business interestingness

Traditionally, knowledge actionability has been investigated mainly by developing and improving technical interestingness. Recently, initial work on technical subjective interestingness and business-oriented profit mining presents general potential, while it is a long-term mission to bridge the gap between technical significance and business expectation. In this paper, we propose a two-way sign...

متن کامل

On rule interestingness measures

This paper discusses several factors influencing the evaluation of the degree of interestingness of rules discovered by a data mining algorithm. The main goals of this paper are: (1) drawing attention to several factors related to rule interestingness that have been somewhat neglected in the literature; (2) showing some ways of modifying rule interestingness measures to take these factors into ...

متن کامل

Implicit Enumeration of Patterns

Condensed representations of pattern collections have been recognized to be important building blocks of inductive databases, a promising theoretical framework for data mining, and recently they have been studied actively. However, there has not been much research on how condensed representations should actually be represented. In this paper we study implicit enumeration of patterns, i.e., how ...

متن کامل

Numeric Multi-Objective Rule Mining Using Simulated Annealing Algorithm

Abstract as a single objective one. Measures like support, confidence and other interestingness criteria which are used for evaluating a rule, can be thought of as different objectives of association rule mining problem. Support count is the number of records, which satisfies all the conditions that exist in the rule. This objective represents the accuracy of the rules extracted from the da...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005